Introduction:

In this report, our aim is to analyze a criminal dataset from Colchester in 2024-2025 that consists of street level crime incidents and it has been extracted using the UK police website. We will be dwelling into the trends, pattersna and insights that are part of our dataset.

We would explore the dataset that spans over a geographical scope. From low-level theft to serious crime, from urban areas to rural ones. The dataset basically provides a detail overview of the crime and their affects on different areas. We shall identify different trends and patterns and then we can gain a better understanding of the main factors that actually affect the crime rates like time,location and demographics. The analysis shall help us in developing strategies to contribute to a safe environment.

crime_data <- read.csv("~/Desktop/Data Visualization/crime2024-25.csv")
#Finding the column names of crime data
colnames(crime_data)
##  [1] "X"                "category"         "persistent_id"    "date"            
##  [5] "lat"              "long"             "street_id"        "street_name"     
##  [9] "context"          "id"               "location_type"    "location_subtype"
## [13] "outcome_status"
#Finding the dimensions of crime data
dim(crime_data)
## [1] 6047   13
#Finding the no.of of rows
nrow(crime_data)
## [1] 6047
#Finding the no.of columns
ncol(crime_data)
## [1] 13

Descriptive Analysis:

We are now examining the crime dataset in detail and gaining some useful insight about it.Firstly, we have a total of 6047 records in our dataset which is a huge amount. Moreover, we have a total of 12 variables that represent important features of our criminal crime data.

Variables: We are planning to break down the dataset column names into simple subgroups to provide a concise and brief description:

A. Crime Information:

1.Category:The column represents the type of the crime committed (e.g: burglary, criminal damage arson, posession of weapons, shoplifting, etc..)

  1. persistent_id: The column represents a unique ID that identifies each crime incident.

  2. Date: This column represents the date the incident took place.

B. Location

  1. Lat & Long: The latitude & Longitude of the location where the crime occured.
  2. street_id: The column represents the id of the street where the crime occured.
  3. street_name: The column represents the name of the location where the crime occured.
  4. location_type: The column represents detailed information about the subtype of the location where crime has occured.

Additional Information:

  1. context: This column represents additional information of the crime incident. The circumstances surrounding the crime incident.
  2. outcome_status: This column represents the outcome of the crime investigation.

PRE-PROCESSING:

Now lets preprocess the data, to ensure that our dataset is ready to perform visualization and analytical analysis. I started exploring the missing values by looking for the proportion in our crime dataset and therefrore i created a table that reflects the proportion of missing values for each variable in our dataset.

#Identifying the missing values
missing_values <- sapply(crime_data, function(x) sum(is.na(x)))

#Verifying if there are any missing values
if (any(missing_values>0)){
  missing_portion <- missing_values/nrow(crime_data)

#Creating a summary table
missing_summary <- data.frame(
  Variables = names(crime_data),
  Missing_Values = missing_values,
  Proportion_Missing = missing_portion
)
#Printing statistics
print(missing_summary)
} else {
  print("No missing values or NA values found")
}
##                         Variables Missing_Values Proportion_Missing
## X                               X              0           0.000000
## category                 category              0           0.000000
## persistent_id       persistent_id              0           0.000000
## date                         date              0           0.000000
## lat                           lat              0           0.000000
## long                         long              0           0.000000
## street_id               street_id              0           0.000000
## street_name           street_name              0           0.000000
## context                   context           6047           1.000000
## id                             id              0           0.000000
## location_type       location_type              0           0.000000
## location_subtype location_subtype              0           0.000000
## outcome_status     outcome_status            668           0.110468

As per the above analysis, we can check that the proportion of missing values for the ‘context’ variable is 1, which basically means that there isn’t any valye of context in our dataset. Therefore, it won’t contribute to our analysis or visualization at all. So as per our analysis, we shall drop this column entirely and it won’t be part of the visualization.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Drop the column with missing values
new_data <- select(crime_data, -context)

We are now done with analysing the dataset and we have now performed pre-processing, we shall now dive deeper into the subparts of the crime data set.

We shall now dive deeeper in the dataset and shall now observe the most important piece of information which is the categories in our dataset. We will try to get an idea to see on how each crime accounts in our dataset i.e. the distribution of the crime categories. For this, we shall create a table that reflects the frequency of the crime categories

#Calculating the frequency table 
category_table <- table(crime_data$category)
category_table
## 
## anti-social-behaviour         bicycle-theft              burglary 
##                   668                   151                   157 
## criminal-damage-arson                 drugs           other-crime 
##                   466                   231                    91 
##           other-theft possession-of-weapons          public-order 
##                   399                    58                   451 
##               robbery           shoplifting theft-from-the-person 
##                    81                   643                    84 
##         vehicle-crime         violent-crime 
##                   253                  2314
#Calculating the category using percentage
category_percentage <- round(100*prop.table(category_table),2)
category_percentage
## 
## anti-social-behaviour         bicycle-theft              burglary 
##                 11.05                  2.50                  2.60 
## criminal-damage-arson                 drugs           other-crime 
##                  7.71                  3.82                  1.50 
##           other-theft possession-of-weapons          public-order 
##                  6.60                  0.96                  7.46 
##               robbery           shoplifting theft-from-the-person 
##                  1.34                 10.63                  1.39 
##         vehicle-crime         violent-crime 
##                  4.18                 38.27

Using the above table, we shall now observe the percentage of occurances of each crime category and we can then clearly see the most frequently committed crime is ‘violent-crime’ which accounts for 38.27% and the second most frequent is ‘anti-social-behaviour’ which accounts for 11.05% and the third most frequent is ‘shoplifting’. These insights will help the authority to work on the resources more efficiently. For example, they shall assign more police officers in areas where violen-crime is the highest. Moreover, they shall understand the mindset of ‘anti-social behaviour’ and shall take measures to reduce it, like they shall start guidance programs and engagement activities for people that go through it.

All in all, examining the percentage of the type of crime helps in revealing important patterns and trends and helps the authorities in making informed decisions.

We shall now explore the data from a new angle to gain deeper insights. In our dataset, we have a section called ‘date’ which basically consists of the month and the year of the crime. We will examine the date against the category to get an idea. For this, I creates a contigency table to get an idea of the date against the category. We can see the resulting table in the output.

#Creating a two way table
date_category_table <- table(crime_data$category,crime_data$date)
date_category_table
##                        
##                         2024-04 2024-05 2024-06 2024-07 2024-08 2024-09 2024-10
##   anti-social-behaviour      70      80      63      53      58      58      56
##   bicycle-theft              12       6       9      12       9      12      19
##   burglary                   10      13       9      18      16       8      17
##   criminal-damage-arson      43      63      44      51      39      33      33
##   drugs                      25      12      12      17      19      25      21
##   other-crime                10      12       6       7       9       6      12
##   other-theft                34      41      34      33      35      32      38
##   possession-of-weapons       5       8       6       5       7       5       3
##   public-order               33      32      42      49      53      39      37
##   robbery                     6       7       9      10       7      10       8
##   shoplifting                40      59      42      58      37      47      64
##   theft-from-the-person       6       8       7      12       8       4       7
##   vehicle-crime              14      13      15      41      52      17      27
##   violent-crime             163     214     192     242     184     223     195
##                        
##                         2024-11 2024-12 2025-01 2025-02 2025-03
##   anti-social-behaviour      56      44      41      45      44
##   bicycle-theft              29      15       9       6      13
##   burglary                   25      10       6      10      15
##   criminal-damage-arson      30      27      28      39      36
##   drugs                      19      34      18      15      14
##   other-crime                 4       6      10       5       4
##   other-theft                30      36      31      26      29
##   possession-of-weapons       2       4       6       1       6
##   public-order               36      24      24      40      42
##   robbery                     6       0       7       8       3
##   shoplifting                74      61      50      69      42
##   theft-from-the-person       8       9       5       2       8
##   vehicle-crime              13      13      14      19      15
##   violent-crime             177     209     159     180     176

The contingency table above shows a detail of the crime incidents that has occurred in different months and categories. Firstly, ‘violent-crime’ is the highest rated crime in the entire yet with the value majorly above 150 to 250 throughout the year. While other categories which are ‘anti-social-behavior’, ‘shoplifting’, ‘burglary’ show notable frequencies as they generally occur less than the ‘violent crime’. Furthermore, the type of crime like ‘possession of weapons’, ‘robbery’, ‘theft from person’ occur less frequently as compared to the violent crime. In addition to, some types of crime like ‘vehicle crime’ show fluctuations like in august 2024, the number was 52, whereas in December 2024, it was 13 which basically shows the changes in criminal behavior. Furthermore, ‘robbery’ has a significant consistency majorly throughout the year, with 0 robbery rate in just December 2024 which might indicate that holiday season has a lesser robbery rate. All in all, these observations can help guide specific action, distribution of resources, and policy making to effectively addressing the crime rate and reducing the risks of crime.

#Loading library
library (ggplot2)
#sorting the categories in terms of frequency
sort_categories <- names(sort(category_table, decreasing = TRUE))

#Converting category to a factor 
crime_data$category <- factor(crime_data$category,levels=sort_categories)

#Creating a barchart in decreasing order of frequency
ggplot(crime_data, aes(x = reorder(category, as.numeric(category)), fill=category)) + geom_bar() + labs(title = "Frequency of Crime Category", x= NULL, y= "Frequency") + theme(axis.text.x = element_text(angle=45, hjust=1), 
plot.title = element_text(hjust = 0.5))

The bar plot above provide a clear idea of the frequency of different categories of crime, with violent crime being on the top, followed by anti-social behavior and shoplifting. The visualization is direct to various stakeholders that includes law enforcement agencies, police departments, heathcare facilities and public. We have presented the data in descending order of frequency, so the plot basically helps in identifying the most common crime issue in the community. The information helps in making an informed decision.

#Loading libraries
library (dplyr)
#Loading libraries
library(ggplot2)
count_outcome <- crime_data %>% 
  group_by(category, outcome_status) %>%
  summarise(count = n ())
## `summarise()` has grouped output by 'category'. You can override using the
## `.groups` argument.
#Plotting enhanced bar plot of outcomes against crime
ggplot(count_outcome, aes(x=category, y=count, fill = outcome_status)) +
  geom_bar(stat = "identity", position = "stack") +
  labs (title = "Outcomes against Crime",
        x= "Category",
        y= "Count",
        fill = "Outcome Status")+
  theme_classic() + 
  theme(
    axis.text.x = element_text(angle=45, hjust=1),
        plot.title = element_text(hjust = 0.07))

The bar plot above provides a detailed overview of crime categories and their outcomes, by offering valuable insights that are not very apparent from the data set. We have incorporated colors in the above chart to map outcome status against crime category. The plot enables efficient visualization of complex information in a quick glance.

All in all, these visualization can help the authorities making informed decisions, and they can then allocate resources effectively and can implement the strategies to address crime issues which can help in creating a safe environment.

#Installing libraries
library(ggplot2)
#Installing libraries
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
#Converting the date column to yearmon Format
crime_data$Month <- as.yearmon(crime_data$date)

#Creating histogram with colored bars
ggplot(crime_data, aes(x= Month, fill = factor(Month))) +
  geom_bar(color= "red") +
  labs(title = "Frequency of Crime Incidents by Every Month",
       x= "Month",
       y = "Frequency") +
  scale_x_yearmon(format = "%b")+
  theme(plot.title = element_text(hjust = 0.5))

The histogram above represents the no.of crimes that have occurred every month which shall help the authorities in making an informed decision. For example, July has the highest no.of crime incident which is more than 600 which may prompt increases police patrols or having a program to find the underlying cause. On the other end, the sudden drop in crime in January which is approximately 400 can help get an idea on how this number is low, this can be because of seasonal trends, weather patterns or some new law enforcement. In conclusion to, by plotting the no.of crime rate against months, we get a detailed idea on how to enhance public safety and what majorly impacts the criminal behavior.

The scatter plot above shows the location of crime with latitude and longitude as y and x-axis respectively, this offers a compelling visual narrative of spatial crime patterns. Each point on the plot represents a specific location of where a crime is committed and every color representing a different category of crime.

#Creating the scatter plot
ggplot(crime_data, aes(x=long, y=lat, color=category))+
  geom_point() +
  labs(title = "Location of Crime",
       x= "Longitude",
       y= "Latitude") +
  scale_color_discrete(name = "Crime Category")

By examining the plot, clusters of densely packed points emerge, suggesting areas that has a higher incidence of criminal activity. Moreover, the empty spaces indicate the areas with lower or no crime rates. These insights from the visualization can empower authorities to make informed decisions with respect to new laws, resource allocation and community outreach.

By identifying the hot spots of criminal activity, police or law agencies can make informed decisions in a way they can deploy patrols to deter and prevent crime in high risk areas.

Moreover, law agencies and community leaders can use this information to contribute to crime in specific neighborhoods. Therefore, spatial analysis provides the scatter plot equips the authorities to prioritize the responses, ultimately creating a safer and more secure communities.

-SINA PLOT

I’ve plotted the sina plot, which is basically a density plot which is used to display a spatial data,using the density contours. The longitude and latitude are displayed on the x and y axis. The plot illustrates the method of analyzing the slightly higher crime areas.

library(ggplot2)
#Creating Sina plot
ggplot(crime_data, aes(x=long, y= lat)) +
  geom_point(alpha = 0.5) +
  geom_density_2d(color = "red")+
  scale_fill_viridis_c() +
  labs(title = "Sina Plot of location",
       x= "Longitude",
       y= "Latitude") +
  theme_minimal()

The contour lines indicate area with a higher density of connected crimes. The regions with the dense contour lines are referred to as “hotspots” due to their higher crime incident density. In contrast, the “cold spots” with lower crime densities are visible areas that are surrounded by thinner contour lines. Law enforcement organizations, city planners and legislators make better decisions if they have a thorough understanding of the density of the crime incidents. This can help in supporting more efficient resource allocation, focused crime prevention initiatives, and the implementation of tactics to lower crime rates in densely populated areas.

-Time Series Plot:

I have created a time series graph showing how the frequency of crime changes over time. I’ve used smoothing to make trend easier to examine. The x-axis displays the dates from April 2024 to May 2025.

#Loading libraries
library(dplyr)
library(ggplot2)
library(zoo)
#Creating the time series graph
#Converting the date column to a proper Date format
crime_data$date <- as.yearmon(crime_data$date)

#Aggregating data by month
crime_counts <- crime_data %>%
  group_by(date) %>%
  summarise(crime_count = n())

ggplot(crime_counts, aes(x=date, y=crime_count)) +
  geom_line() +
  geom_smooth(method = "loess", se= FALSE, color = "blue") +
  labs(title = "Crime Frequency Over Time with smoothing",
       x = "Date",
       y = "Number of Crimes") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The plot ranges from Apr 2024 to Mar 2025, I’ve created a time series graph that shows the frequency of crime changes over time. I’ve used smoothing to make the trend easier to understand. The x-axis display the dates from April 2024 to March 2025, while the y-axis shows the number of crimes. It looks like there’s a pattern of more crimes happening during the summer months (May 2024 to July 2024). Overall, crime frequency seems the highest number of crime is found in July 2024 while the lowest is found in Jan 2025.

#Calculating correlation matrix
correlation_matrix <- cor(crime_data[, c('lat','long','street_id')])
print(correlation_matrix)
##                   lat        long   street_id
## lat        1.00000000 -0.12775122 -0.02948315
## long      -0.12775122  1.00000000  0.03228818
## street_id -0.02948315  0.03228818  1.00000000
#Creating correlation heatmap
library(corrplot)
## corrplot 0.95 loaded
corrplot(correlation_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45)

- HEATMAP DATA

Based on the heatmap data, all variables shows a strong positive correlation with themselves, as expected with a perfect correlation value of one. The correlation analysis of the dataset revelead some interesting connections between the variables. A small negative correlation was found between latitude and longitude, that suggests that latitude tends to decrease as longitude increases and vice versa. The fact that this association was so small indicates that latitude and longitude are not really dependent on one another. Moreover, the street_id and latitude/longitude had a very weak connections, with correlation coefficients were close to zero. This implies no meaningful relationship between street ID and the geographic coordinates in the dataset. Overall, the weak correlations involving street ID, along with the minor association between latitude and longitude, highlight the general independence of street ID from the dataset’s geographic variables.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Creating the scatter plot with colors mapped to crime categories
int_scat <- ggplot(crime_data, aes(x=long, y=lat, color=category, text=paste("Crime Category: ", category))) + geom_point() + labs(title = "Crime Locations", x = "Longitude", y = "Latitude") + scale_color_discrete(name = "Crime Category")

#Converting to plotly object
int_scat <- ggplotly(int_scat)

int_scat
library(plotly)

#Creating the histogram of outcome status
int_histogram <- ggplot(crime_data, aes(x=outcome_status))+
  geom_bar(fill = "blue") +
  labs(title = "No.of outcome status",
       x = "Outcome Status",
       y = "Frequency") +
  theme(axis.test.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5))

#Convert the ggplot to a plotly object 
int_histogram <- ggplotly(int_histogram)
## Warning in ggfun("plot_theme")(plot): The `axis.test.x` theme element is not
## defined in the element hierarchy.
#Display the interactive plot
int_histogram

Mapping crime data reveals important spatial patterns and highlights areas with high crime incidence. These insights can help law enforcement agencies and governments optimize resource allocation and implement targeted interventions directly addressing criminal activity. Crime maps not only show where crimes occur but also offer potential explanations for why they are concentrated in specific locations. Additionally, visualizing crime data supports better collaboration among various stakeholders—such as police departments, local authorities, and NGOs—enabling the development of more comprehensive, coordinated strategies to prevent crime and enhance public safety.

#Load Libraries
library(leaflet)
library(dplyr)
#creating a leaflet map
crime_map <- leaflet(crime_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~long,
    lat = ~lat,
    radius = 5,
    color = ~category,
    fillOpacity = 0.8,
    popup = ~paste("Category: ", category)
  )

crime_map

Studying a set of criminal offenses involves multiple steps that contribute to a comprehensive understanding of criminal behavior and this shall support informed decision making. The process begins with exploring the structure and features of the data set. Pre-processing tasks, such as handling missing values, helps ensure data integrity and improving the reliability of the findings.

To uncover trends, patterns, and associations within the data, various visualizations are employed, including tables, two way tables, bar plots, histograms, Sina plots and scatter plots. These graphical representations play a crucial in identifying the crime hotspots, understanding temporal variations, and analyzing relationships between variables - valuable insights for strategic planning, tactical decision-making, and efficient resource allocation.

Correlation analysis enhances the process by revealing potential factors that contributes to rising crime rates. Interactive plots and maps also promote user engagement, enabling dynamic exploration of spatial and temporal crime patterns. All in all, the analytical approaches empower stakeholders such as law enforcement to make evidence-based decisions and developing effective crime prevention strategies.

CLIMATE DATA

Introduction: This report presents an analysis of climate data collected from a weather station near Colchester. Our aim is to explore the region’s climate by identifying the trends, patterns and key insights within the dataset. By examining and visualizing the data related to temperature, precipitation, wind and humidity, we can gain a clear understanding of weather patterns, seasonal variation, and long-term climate change in the Colchester area.

The analysis enables stakeholders to identify climate-related risks, vulnerabilities and opportunities for adaptation. The insights derived from the data can inform well-founded decisions across various sectors, including infrastructure planning, disaster preparedness, tourism and agriculture. All in all, this contributes to enhancing the resilience of Colchester and its surrounding areas in the face of climate and change.

#Loading the data
climate_data <- read.csv("~/Desktop/Data Visualization/temp2024-25.csv")
#finding the dimensions
dim(climate_data)
## [1] 365  18
#finding the no.of rows
nrow(climate_data)
## [1] 365
#finding the no.of columns
ncol(climate_data)
## [1] 18
#finding the no.of columns
colnames(climate_data)
##  [1] "station_ID"      "Date"            "TemperatureCAvg" "TemperatureCMax"
##  [5] "TemperatureCMin" "TdAvgC"          "HrAvg"           "WindkmhDir"     
##  [9] "WindkmhInt"      "WindkmhGust"     "PresslevHp"      "Precmm"         
## [13] "TotClOct"        "lowClOct"        "SunD1h"          "VisKm"          
## [17] "SnowDepcm"       "PreselevHp"

Let’s take a closer look at the contents of this climate dataset to uncover some meaningful insights. We begin with a simple descriptive analysis to understand the basic characteristics of the data. The dataset contains a total of 365 records by providing a full year of observations which is a solid foundation for analysis.

There are 16 variables in the dataset, each representing an important aspect of climate measurement. Below is a brief overview of variables: - station_ID : Identifier for the weather station where the data was collected. - Date: The date on which the weather observations were recorded. - TemperatureCAvg: Average Temperature recorded on the date. - TemperatureCMax: Maximum Temperature recorded on the date. - TemperatureCMin: Minimum Temperature recorded on the date. - TdAvgC: Average dew point temperature. - HrAvg: Average relative humidity. - WindkmhDir: Wind direction recorded on that date. - WindkmhInt: Average wind speed (in kilometres per hour) - WindkmhGust: Maximum wind gust speed (in kilometres per hour) - PresslevHp: Atmospheric pressure (in hectopascals) - Precmm: Total precipitation (in millimeters) - TotClOct: Total cloud cover observed - SunD1h: Sunshine duration (in hours) - VisKm: Visibility (in kilometers)

Now let’s move on to preprocessing the data, a critical step to ensure the dataset is ready for effective visualization and analysis. I began by investigating the presence of missing values in the climate dataset. To do this, I generated a table that displays the proportion of missing values for each variable, helping to identify which features may require cleaning or imputation before further analysis.

# Identifying the missing values
missing_val <- sapply(climate_data, function(x) sum(is.na(x)))

# Determine the proportion of missing values
missing_proportion <- missing_val / nrow(climate_data)

# Creating a summary table
missing_summary <- data.frame(Variables = names(climate_data),
                              Missing_Values = missing_val,
                              Proportion_Missing = missing_proportion)

#Printing the summary of statistics
print(missing_summary)
##                       Variables Missing_Values Proportion_Missing
## station_ID           station_ID              0        0.000000000
## Date                       Date              0        0.000000000
## TemperatureCAvg TemperatureCAvg              0        0.000000000
## TemperatureCMax TemperatureCMax              0        0.000000000
## TemperatureCMin TemperatureCMin              0        0.000000000
## TdAvgC                   TdAvgC              0        0.000000000
## HrAvg                     HrAvg              0        0.000000000
## WindkmhDir           WindkmhDir              0        0.000000000
## WindkmhInt           WindkmhInt              0        0.000000000
## WindkmhGust         WindkmhGust              0        0.000000000
## PresslevHp           PresslevHp              0        0.000000000
## Precmm                   Precmm             23        0.063013699
## TotClOct               TotClOct              0        0.000000000
## lowClOct               lowClOct              9        0.024657534
## SunD1h                   SunD1h              1        0.002739726
## VisKm                     VisKm              0        0.000000000
## SnowDepcm             SnowDepcm            350        0.958904110
## PreselevHp           PreselevHp            365        1.000000000

From the results we came to know that missing proportion for the ‘PreselevHp’ is 1 and SnowDepcm is 0.96. This means that have almost all their values missing. Hence, it won’t contribute to our analysis at all. So according to our analysis, I dropped these columns entirely and it won’t be a part of my report at all.

updated_data <- climate_data[, !(colnames(climate_data) %in% c("PreselevHp", "SnowDepcm"))]
colnames(updated_data)
##  [1] "station_ID"      "Date"            "TemperatureCAvg" "TemperatureCMax"
##  [5] "TemperatureCMin" "TdAvgC"          "HrAvg"           "WindkmhDir"     
##  [9] "WindkmhInt"      "WindkmhGust"     "PresslevHp"      "Precmm"         
## [13] "TotClOct"        "lowClOct"        "SunD1h"          "VisKm"
#Two Way Table
temp_two_way_table <- climate_data %>%
  group_by(station_ID, Date) %>%
  summarise(
    TemperatureCAvg = mean(TemperatureCAvg, na.rm = TRUE),
    TemperatureCMax = mean(TemperatureCMax, na.rm = TRUE),
    TemperatureCMin = mean(TemperatureCMin, na.rm = TRUE),
    .groups = "drop"
  ) %>%
ungroup()
#View(temp_two_way_table)

To visualize the climate data, I made a very simple visualization of the distribution of wind direcion using a bar plot. The x-axis shows how many times the wind blows from a specific direction, and the y-axis shows the different wind directions.

library(ggplot2)
#Creating the bar plot

ggplot(climate_data, aes(x = factor(WindkmhDir))) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Distribution of Wind Direction",
       x = "Wind Direction",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5),
        axis.title = element_text(size = 12, face = "bold"),
        axis.text = element_text(size = 10),
        legend.position = "none") +
  coord_flip()

The bar plot above, it is quite evident that SW (Southwest) direction had the highest frequency of occurrence among all the observed wind directions in the dataset.

To visualize the distribution of wind intensity and wind gust intensity, I combined both variables into a single column and created a unified histogram. The x-axis represents wind intensity values, while the y-axis indicates their frequency of occurrence.

library(ggplot2)
library(tidyr)
#Combining the data of WindkmhInt & WindkmhGust
combined_data <- climate_data %>%
  pivot_longer(cols = c(WindkmhInt, WindkmhGust), names_to = "Variable", values_to = "Windkmh")

#Histogram of the combined data
ggplot(combined_data, aes(x= Windkmh, fill = Variable)) +
  geom_histogram(binwidth = 5, position = "dodge", color = "black") +
  labs(title = "Distribution of Wind Gust Intensity and Wind Intensity",
       x= "Wind Intensity",
       y = "Frequency",
       fill = "Variable")+
  scale_fill_manual(values = c("skyblue", "salmon"))+
  theme_minimal()

By analysing the above histogram, we can clearly view that wind intensity range of 50 km/h has a bar of almost 26, which means there are approximately 30 observations for it.

# Creating a box plot for Average Temperature and Max Temperature

ggplot(data = climate_data, aes(x = factor(WindkmhDir), y = TemperatureCAvg)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Box Plot of Average Temperature",
       x = "Wind Direction",
       y = "Average Temperature")

The above boxplot is plotted to get a better understanding of how the wind direction impacts the average temperature, I decided to use a box plot, as it is quite evident that westerly winds tend to have a wide range of temperature as compared to the easterly winds E, ENE, NE and NNE.This proves that westerly winds might lead to fluctuating temperatures while easterly winds are linked to consistent temperature range.

#Creating a scatter plot

ggplot(climate_data, aes(x=TemperatureCAvg, y=TdAvgC)) +
  geom_point()+
  labs(title = "Scatter Plot Temperature vs Dew Point Temperature",
       x = "Average Temperature (degC)",
       y = "Average Dew Point Temperature (degC)")

I made a scatter plot to visualize the average temperature and average dew point temperature. The points on the scatter plot proves that their is a relationship between the variables. We can then see a positive correlation between the two as the average temperature when dew point temperature increases.

#Defining the variables selected
selected_variables <- c("TemperatureCAvg", "WindkmhInt", "PresslevHp")

#Creating a subset of the dataset
selected_data <- climate_data[, selected_variables]

#Calculating and displaying the correlation matrix
correlation_matrix <- cor(selected_data)
correlation_matrix
##                 TemperatureCAvg    WindkmhInt PresslevHp
## TemperatureCAvg    1.0000000000 -0.0003929393 -0.2217310
## WindkmhInt        -0.0003929393  1.0000000000 -0.3952205
## PresslevHp        -0.2217309859 -0.3952204817  1.0000000
corrplot(correlation_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45)

I have performed correlation analysis on the Average Temperature, Wind & Pressure. I shall then create a heatmap of the correlation matrix to aid in visualization of coefficients. From analyzing the heatmap above, we can see that the light red color indicates a weak correlation, therefore as temperature increases, pressure decreases, even though the relation is not strong. Moreover, wind and pressure has a weak negative correlation. If the winds are higher then it is associated with low pressure system. Temperature and wind have little or no correlation.

#Convert Date column to Date Format
climate_data$Date <- as.Date(climate_data$Date)

#Check the structure of data
str(climate_data)
## 'data.frame':    365 obs. of  18 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : Date, format: "2025-03-31" "2025-03-30" ...
##  $ TemperatureCAvg: num  9.2 9.6 8.3 10 8.3 8.6 6.5 9 10.8 12 ...
##  $ TemperatureCMax: num  15.8 13.6 14.3 16.4 15.2 12.9 12.7 13.9 16.1 14.9 ...
##  $ TemperatureCMin: num  3 2.9 2.9 2.6 2.6 1.2 1.2 5.2 5.2 7.3 ...
##  $ TdAvgC         : num  2.7 3 1.8 6.3 4.4 6.8 3.1 6.9 7.8 6.8 ...
##  $ HrAvg          : num  66.8 64.6 66.2 78.4 77.8 88.1 81.5 86.9 82.8 70.7 ...
##  $ WindkmhDir     : chr  "NW" "WSW" "WNW" "SW" ...
##  $ WindkmhInt     : num  20.2 21.5 19.1 18.7 11 10.3 12.4 16.6 10.8 16.6 ...
##  $ WindkmhGust    : num  53.7 50 40.8 38.9 35.2 29.7 31.5 35.2 31.5 40.8 ...
##  $ PresslevHp     : num  1023 1018 1014 1016 1024 ...
##  $ Precmm         : num  0 0 0 0 0 0 0.2 1 0.2 0.2 ...
##  $ TotClOct       : num  1.5 3.6 3.1 1.3 0.9 6.4 2.1 7.3 6.7 4.2 ...
##  $ lowClOct       : num  4.5 6.2 6.2 8 3 7.7 6.4 7.3 7 7.3 ...
##  $ SunD1h         : num  9.6 10.1 3.9 11.4 10.9 4.1 6.4 0.1 3.1 5.1 ...
##  $ VisKm          : num  29.2 47.5 49.4 19.9 40.5 11.5 6.7 4.3 16.5 18.5 ...
##  $ SnowDepcm      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PreselevHp     : logi  NA NA NA NA NA NA ...
ggplot(climate_data, aes(x=Date, y=TemperatureCAvg)) +
  geom_line() +
  labs (title = "Time Series of Average Temperature",
        x = "Date",
        y = "Average Temperature") +
  theme_minimal()

I created a time series plot of average temperature against time. The x-axis represents ‘Date’ and y-axis’ represents ‘Temperature’. The time series graph displays the temperature for a 12 month period starting from April 2024 to April 2025. As per the graph above, it is quite evident that the temperature rises in summers and spring and then decreases during the winters. The average temperature used to vary from -2 to 18 degrees celcius throughout the year.

climate_data$Date <- as.Date(climate_data$Date)

#Adding smoothing to the time series graph
ggplot(climate_data, aes(x = Date, y = TemperatureCAvg)) +
   #Adding the line
  geom_line() +        
  # Adding smoothing
  geom_smooth(method = "loess", se = FALSE) +  
  labs(title = "Time Series of Average temperature with Smoothing",
       x = "Date",
       y = "Average Temperature") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

- Histogram with Smoothed Curve

ggplot(climate_data, aes(x= TemperatureCAvg)) +
  geom_histogram(binwidth = 1, fill = "yellow", color = "black") +
  geom_density(aes(y= ..count..), fill = "red", alpha = 0.5) +
  labs (Title = "Average Tenmperature (With Smoothed Density Curve)",
        x = expression("Average Temperature (" * degree * "C)"),
        y = "Frequency")+
  theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(ggplot2)
library(tidyr)
library(plotly)

#Combining the WindkmhInt and WindkmhGust variable
combining_data <- climate_data %>%
  pivot_longer(cols = c(WindkmhInt, WindkmhGust), names_to = "Variable", values_to = "Windkmh")

#Creating a histogram
int_histm <- ggplot(combining_data, aes(x=Windkmh, fill = Variable)) +
  geom_histogram(binwidth = 5,position = "dodge", color = "red") +
  labs(title = "Distribution of Wind intensity and Wind Gust intensity",
       x = "Wind Intensity (km/h)",
       y = "Frequency",
       fill = "variable")+
  scale_fill_manual(values = c ("skyblue","salmon")) +
  theme_minimal()

#Coverting the ggplot to plotly 
int_histm <- ggplotly(int_histm)

#Display the histogran
int_histm

The above histogram illustrates the distribution of two wind related variables: WindkmhInt (regular wind intensity) and WindkmhGust (wind gust activity), both measured in kilometers per hour. The plot above shows that WindkmhInt are concentrated in the lower age that is between 10 and 25 km/h with a peak around 15 km/h. In contrast, WindkmhGust are more spread out, with higher frequencies in the 25 - 45 km/h range and extending beyond 75 km/h. This suggests that the win gusts tend to be stronger and more variable than regular wind intensities.

library(ggplot2)
library(plotly)

climate_data$Date <- as.Date(climate_data$Date)

int_time <- ggplot(climate_data, aes(x=Date, y=TemperatureCAvg)) +
  geom_line() +
  geom_smooth(method = "loess", se= FALSE) +
  labs(title = "Time Series of Average Temperature with Smoothing",
       x = "Date",
       y = "Average Temperature") +
  theme_minimal()

int_time <- ggplotly(int_time)
## `geom_smooth()` using formula = 'y ~ x'
int_time
library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#Reading Data
crime_data <- read.csv("~/Desktop/Data Visualization/crime2024-25.csv", stringsAsFactors = FALSE)
climate_data <- read.csv("~/Desktop/Data Visualization/temp2024-25.csv", stringsAsFactors = FALSE)

# Converting Dates to Monthlu
crime_data$Month <- format(as.Date(paste0(crime_data$date, "-01")), "%Y-%m")

# Summarising monthly crime count
monthly_crime <- crime_data %>%
  group_by(Month) %>%
  summarise(CrimeCount = n())

# Format climate dates
climate_data$Date <- as.Date(climate_data$Date)
climate_data$Month <- format(climate_data$Date, "%Y-%m")

# Sumarising monthly average temperature using correct column name
monthly_climate <- climate_data %>%
  group_by(Month) %>%
  summarise(AvgTemp = mean(TemperatureCAvg, na.rm = TRUE))

# Joining data sets
crime_weather <- left_join(monthly_crime, monthly_climate, by = "Month")

# Printing results
print(crime_weather)
## # A tibble: 12 × 3
##    Month   CrimeCount AvgTemp
##    <chr>        <int>   <dbl>
##  1 2024-04        471    9.08
##  2 2024-05        568   13.4 
##  3 2024-06        490   14.3 
##  4 2024-07        608   16.5 
##  5 2024-08        533   18.1 
##  6 2024-09        519   14.7 
##  7 2024-10        537   11.7 
##  8 2024-11        509    7.24
##  9 2024-12        492    6.50
## 10 2025-01        408    3.45
## 11 2025-02        465    4.46
## 12 2025-03        447    6.96

Two Way Table

#Converting the date
crime_data$date_full <- as.Date(paste0(crime_data$date, "-01"))

#Use wday() 
table(crime_data$category, wday(crime_data$date_full, label = TRUE))
##                        
##                         Sun Mon Tue Wed Thu Fri Sat
##   anti-social-behaviour 102 123  56 121  58  56 152
##   bicycle-theft          27  24  19  15   9  29  28
##   burglary               18  28  17  19  16  25  34
##   criminal-damage-arson  60  94  33  91  39  30 119
##   drugs                  59  42  21  30  19  19  41
##   other-crime            12  17  12  22   9   4  15
##   other-theft            68  67  38  72  35  30  89
##   possession-of-weapons   9  10   3  14   7   2  13
##   public-order           63  82  37  56  53  36 124
##   robbery                10  16   8  14   7   6  20
##   shoplifting           108  98  64 109  37  74 153
##   theft-from-the-person  13  18   7  13   8   8  17
##   vehicle-crime          30  55  27  27  52  13  49
##   violent-crime         432 405 195 373 184 177 548

-Conclusion

The comprehensive study of the climate dataset provided us with some valuable insights into key variables that enhance our understanding of local climate dynamics. Descriptive analysis offered a solid foundation by illustrating the dataset’s characteristics, while thoroughly pre-processing to ensure data integrity. The use of two-way tables, bar plots, histogram, bar plots, scatter plots, time-series plot & correlation analysis effectively highlight highlights patterns, trends, and relationships within the data. The visualizations revealed temperature changes over the years, precipitation levels, and variations in wind speed across different years and location, enabling identification of seasonal fluctuation and potential correlations. Furthermore, the incorporation of interactive plots made the analysis more accessible, fostering a deep connection betweeen viewers and the data. All in all, the extensive research serves as a valuable reference for informed decision making across various industries by uncovering weather related patterns which eventually allows to have a proactive approach to risks and opportunities.